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Abstract 

While adaptive sensing has provided improved rates of convergence 
in sparse regression and classification, results in nonparametric regres- 
sion have so far been restricted to quite specific classes of functions. 
In this paper, we describe an adaptive-sensing algorithm which is ap- 
plicable to general nonparametric-regression problems. The algorithm 
is spatially-adaptive, and achieves improved rates of convergence over 
spatially-inhomogeneous functions. Over standard function classes, it 
likewise retains the spatial adaptivity properties of a uniform design. 

1 Introduction 

In many statistical problems, such as classification and regression, we ob- 
serve data Yi, I2, • • • , where the distribution of each Y n depends on a choice 
of design point x n . Typically, we assume the fixed in advance. In prac- 

tice, however, it is often possible to choose the design points sequentially, 
letting each x n be a function of the previous observations Y±, . . . , Y n ~\- 

We will describe such procedures as adaptive sensing, but they are also 
known by many other names, including sequential design, adaptive sampling, 
active learning, and combinations thereof. The field of adaptive sensing has 
seen much recent interest in the literature: compared with a fixed design, 
adaptive sensing algorithms have been shown to provide improvements in 
sparse regression (Fan and Lv, 2008; Haupt et al., 2011; Malloy and Nowak, 
2011b; Davenport and Arias-Castro, 2012) and classification (Cohn et al., 
1994; Castro and Nowak, 2007; Beygelzimer et al, 2009; Koltchinskii, 2010; 
Hanneke, 2011). Recent results have also focused on the limits of adaptive 
sensing (Arias-Castro et al., 2011; Malloy and Nowak, 2011a; Castro, 2012). 

Mathematics subject classification 2010. 62G08 (primary); 62L05, 62G20 (secondary). 
Keywords. Nonparametric regression, adaptive sensing, sequential design, active 
learning, spatial adaptation, spatially-inhomogeneous functions. 



In this paper, we will consider the problem of nonparametric regression, 
where we aim to estimate an unknown function / : [0, 1] — > M from observa- 
tions 

Y n :=f{x n ) + s n , £ n l ~- N(0,a 2 ) 

(see Tsybakov, 2009). While previous authors have also considered this 
model under adaptive sensing, their results have either been restricted to 
quite specific classes of functions /, or have not provided improved rates of 
convergence (Faraway, 1990; Cohn et al., 1996; Hall and Molchanov, 2003; 
Castro et al, 2006). 

For general functions /, it is said that adaptive sensing provides little 
benefit. For example, suppose the function / lies within the class C S (M,I) 
of functions which are s-H61der, with norm at most M, on an interval I open 
in [0, 1] . With a fixed design, we can estimate / at any point in /, with error 
decaying, up to log factors, like n~ a<yS \ where 

a(s) := s/{2s + 1). 

We can even do so: 

(i) adaptively, without prior knowledge of s, M or /; and 

(ii) uniformly, over closed intervals J C. I. 

Theorem 1. Let s max > ^. Using a uniform design X{ = (i — l)/n, there 
exists an estimator f n , which satisfies 

sup|/ n (x) - f(x)\ = O p {c n ), 

uniformly over f G C S (M,I) n C^(M), for any s G [|, s max ], M > 0, I an 
interval open in [0, 1], J C L a closed interval, and 

Cn = (n/log(n))-^. 

We can do no better using adaptive sensing. Castro et al. (2006) prove 
such a result for global L 2 loss; we will show the same is true, up to log 
factors, locally uniformly. Define an adaptive-sensing algorithm to be a 
choice of design points x n = x n (Y\, . . . , Y n __i), together with an estimator 
fn = fn(Y 1 ,...,Y n ) Of /. 

Theorem 2. Let s > i, M > 0, I an interval open in [0, 1], and J C I a 
closed interval. Given an adaptive-sensing algorithm with estimator f n , if 

sup|/ n (x) - f(x)\ = O p (c n ) 
uniformly over f G C S (M, L) n C^(M), then 
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Nevertheless, in this paper we will describe an adaptive-sensing algo- 
rithm, applicable to general nonparametric regression problems, which of- 
fers significantly improved convergence rates for many functions of interest. 
We will argue that Holder classes are not the best way to measure smooth- 
ness in this problem, and that other smoothness classes may prove more 
meaningful. 

Our starting point will be the field of spatial adaptation, which considers 
the estimation of spatially-inhomogeneous functions. These functions f{x) 
have smoothness varying with x\ they may be rougher in some regions of 
space, but smoother in others. The seminal paper of Donoho and Johnstone 
(1994) provides examples, which we have reproduced in Figure 1; these 
mimic the kinds functions observed in imaging, spectroscopy and other signal 
processing problems. Estimating such functions has been a topic of much 
interest in the literature (see also Donoho et al., 1995; Fan and Gijbels, 1995; 
Lepski et al., 1997; Donoho and Johnstone, 1998; Fan et al., 1999). 

A spatially-adaptive estimate f n is one which attains optimal rates of 
convergence over a variety of spatially-inhomogeneous functions. The esti- 
mate in Theorem 1, for example, is spatially adaptive, obtaining optimal 
rates over many local Holder classes. Looking at Figure 1, we might expect 
that adaptive sensing should be able to provide further improvements: if we 
placed more design points in regions where / is rough, we would expect our 
estimates f n to become more accurate overall. 

While we could always do so in some heuristic fashion, we would prefer 
to construct an algorithm which has a theoretical justification. In the fol- 
lowing, we will describe new smoothness classes of spatially-inhomogeneous 
functions. We will then detail an adaptive-sensing algorithm which obtains 
improved convergence rates over these functions, while simultaneously never 
obtaining worse rates than a fixed design. 

Smoothness classes similar to our own have arisen in the study of adap- 
tive nonparametric inference (Picard and Tribouley, 2000; Gine and Nickl, 
2010; Bull, 2011). As in those papers, we find that for complex nonparamet- 
ric problems, the standard smoothness classes may be insufficient to describe 
behaviour of interest; by specifying our target functions more carefully, we 
can achieve more powerful results. 

We might also compare this phenomenon to results in sparse regression, 
where good rates are often dependent on specific assumptions about the 
design matrix or unknown parameters (see Fan and Lv, 2008; van de Geer 
and Buhlmann, 2009; Meinshausen and Buhlmann, 2010). As there, we 
can use the nature of our assumptions to provide insight into the kinds of 
problems on which we can expect to perform well. 

We will test our algorithm by estimating the functions in Figure 1 under 
noise. We will see that, by sensing adaptively, we can make significant 
improvements to accuracy; we thus conclude that adaptive sensing can be 
of value in nonparametric regression whenever the unknown function may 
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Figure 1: Examples of spatially-inhomogeneous functions from Donoho and 
Johnstone (1994). Each function is scaled to have sd(/) = 7. 



be spatially inhomogeneous. 

In Section 2, we describe our model of spatial inhomogeneity, and show 
that adaptive sensing can lead to improved performance over such func- 
tions. In Section 3, we describe our adaptive-sensing algorithm in detail; 
in Section 4, we discuss the implementation of our algorithm, and provide 
empirical results. Finally, we provide proofs in Appendix A. 



2 Main results 

To provide better rates of convergence over spatially-inhomogeneous func- 
tions, we will need to exploit two features apparent in the functions of Fig- 
ure 1. We begin with a discussion of these features, before showing that 
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they allow us to obtain improved rates of convergence; we conclude with 
technical details of our smoothness classes. 



2.1 Spatially-inhomogeneous functions 

The first feature we will need from the functions in Figure 1 is that they are 
sparse. While in some regions the functions are rough, in others the functions 
are very smooth. It is this difference between rough and smooth which allows 
us to improve performance, placing more design points in rougher regions. 

Describing smoothness in terms of Holder spaces C r does not capture 
this sparsity, as the Holder smoothness of a function is determined only 
by its roughest point. There are other function classes, however, which do 
measure sparsity. For example, the Sobolev spaces W r ' p , where p S [1, oo), 
contain functions which are r-smooth on average, but may be rougher in 
places; in particular, this includes the Sobolev Hilbert spaces H r = W r ' 2 . 
The spaces BV of functions of bounded variation likewise contain functions 
which are only on average differentiable. 

We can describe all these classes in terms of the Besov spaces Bp^, 
which are commonly used to measure smoothness in the spatial adaptation 
literature. The Holder spaces C r C B^ )0C , the Sobolev spaces W r,p C B r p oo , 
and the space of functions of bounded variation BV C -B| j00 . For statistical 
purposes, we can achieve the same rates of estimation over these enlarged 
classes; we may thus restrict our attention to Besov classes B poo in what 
follows. The parameter r measures the average smoothness of functions 
/ ^ Bp,ooi an d the parameter p their sparsity; smaller values of p correspond 
to sparser functions. 

Sparsity alone, however, does not allow us with to benefit from adaptive 
sensing. For M > 0, let B poo {M) denote the functions in B poo having 

Besov norm at most M. Since B p oo (M) CC r p (M), we can achieve the rate 
-a(r--) 

n v v > , up to log factors, with the fixed-design method of Theorem 1. We 
can then show that, in this case, adaptive sensing offers little improvement. 

Theorem 3. Let p G [l,oo], r > ^ + A, M > 0, and I be an interval in 
[0,1]. Given an adaptive-sensing algorithm with estimator f n , if 



To obtain improved rates, we must use another feature of the functions 
in Figure 1: when sampling from these functions, we can quite easily tell 




uniformly over f G B pao (M), then 




5 



which regions are rough, and which are smooth. This property is likewise 
necessary to obtain improved performance: we can place more design points 
in rough areas of / only once we know where those areas are. 

We will call such functions detectable, and give a formal definition of this 
property in Section 2.3. For now, we will say that we consider classes Df(I) 
of functions whose irregularities are detectable over an interval / C [0,1]. 
The parameter s > governs the minimum smoothness we require over /, 
while t G (0, 1) controls how easy it is to detect irregularities. When t ~ 0, 
it may be hard to detect irregularities; when t ~ 1 it is easy. 

We will be interested in functions / satisfying both these conditions. Fix 
Smax > §, and for p G [l,oo], r G [§ + ^,s max ], s G [r - ^,s max ], t G (0, 1), 
M > 0, and any interval I C [0, 1], let 

F = T(p,r,s,t,M,I) := B r p ^(M) n D s t (I) (1) 

denote a class of sparse and detectable functions. 

We note that this class has two smoothness parameters: r governs the 
average global smoothness of a function / G J 7 , while s governs its local 
smoothness over /. Since functions in -Bp jDO are everywhere at least (r — -)- 

smooth, we have restricted to the interesting case s > r — |; with this in 
mind, we can show that functions / G T are locally s-H61der. 

Proposition 4. T C C S (M', I), for some M' > 0. 

These conditions therefore combine to give us a characterisation of local 
smoothness, not unlike a local Holder class. Indeed, we can show that 
given a fixed design, the worst-case error is the same; requiring sparsity and 
detectability thus does not make estimation fundamentally easier. 

Theorem 5. Using a fixed design, if an estimator f n satisfies 

sup|/ n (x) - f(x)\ = O p (c n ), 

uniformly over f G J-, then 

c n > (n/log(n))- Q W. 

2.2 Benfits of adaptive sensing 

With adaptive sensing, however, we can take advantage of these conditions 
to obtain improved rates of convergence. Indeed, we can do so even without 
knowledge of the class J 7 ; we can thus adapt not only to the regions where 
/ is rough, but also to the overall smoothness and sparsity of /. 

Theorem 6. There exists an adaptive- sensing algorithm, depending only on 
s max , whose estimator f n satisfies 

sup|/ n (x) - f(x)\ = O p {c n ), 
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uniformly over f G T , for u := max(r — s, 0), 

r':=s + tu, s' := s/(l -ptu), (2) 

and 

c n := (n/log(n) 3 )- a(min(r ' log (n) 1 (^>. (3) 

We thus obtain, up to log factors, the weaker of the two rates n~ a ( r ) and 
n — a(s )_ goth of these rates are at least as good as the n~ a ^ bound faced 
by a fixed design; when s < r, and the function / may be locally rough, 
the rates are strictly better. In that case, we obtain the n~ a ( r ' rate when 
s/r > (1 — t)/(2 — t), and the n~ a ( s ' rate otherwise. 

The improvement is driven by two parameters: t, which governs how easy 
it is to detect irregularities of /, and u, which governs how much rougher / 
is locally than on average. When both t and u are large, the rates we obtain 
are significantly improved; in the most favourable case, when p = u = 1, 
and t ~ 1, this result is equivalent to gaining an extra derivative of /. We 
can even show that these rates are near-optimal over classes J 7 . 

Theorem 7. Given an adaptive-sensing algorithm with estimator f n , if 

sup|/ n (x) - f(x)\ = O p (c n ), 

uniformly over f G J 7 , then 

r > n -a(rain{r',s')) 

forr',s' given by (2). 

Furthermore, we also have that, even in the absence of sparsity or de- 
tectability, we still achieve the spatial adaptation properties of a fixed design. 
We may thus use our adaptive-sensing algorithm with the confidence that, 
even if / is spatially homogeneous, we will not pay an asymptotic penalty. 

Theorem 8. There exists an adaptive-sensing algorithm, satisfying the con- 
ditions of Theorem 6, which also satisfies 

sup|/ n (x) - /(a?) | = O p (c n ) , 
xeJ 

uniformly over f G C S (M, I) n C^(M), for any s G [|, s max ], M > 0, I an 
interval open in [0,1], J Q I a closed interval, and 

c n ■= (n/log(n))~ Q(s) . 
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2.3 Definitions of smoothness 

We now give a precise definition of our smoothness classes, using wavelet 
techniques. For jo G N, let 

tp jik and ip jtk , j G {jo, jo + 1, . . . }, k G {0, . . . , 2 J - 1}, 

be a compactly-supported wavelet basis of L 2 ([0, 1]), such as the construction 
of Cohen et al. (1993). We will assume the wavelets tpj >k have N > s max 
vanishing moments, 

x n ipj,ki x ) dx = 0, n = 0, . . . , N - 1, 

and both (fj )k and ifij >k are zero outside intervals Sj )k of width 2~ J (2L — 1), 
Sj, k ■= 2- j [k - L + 1, k + L) n [0, 1). 
We can then describe any function / G L 2 ([0, 1]) by its wavelet series, 

2^0-1 oo 2 3 -l 

/ = X] a h,k<Pjo,k + E E &>^> 

k=0 j=jo k=0 

The smoothness of / is determined by the size of the coefficients otj 0)k , Pj,k', 
f is smooth when the coefficients are small. We can use this property to 
define the Holder classes C s (M) , local Holder classes C s (M, I) , and Besov 
classes B r poo (M) (Hardle et al., 1998). 

Definition 9. For s G (0, N), M > 0, and I C [0,1], C S (M,I) is the class 
of functions f G L 2 ([0, 1]) satisfying 

max[2 J0 ( s+ 5) sup \a j0jk \, sup2 J '( s +l) SU p \p jik \\ < M. 

\ k:Sj , k CI ' j=j k:S jtk CI ' J 

For I = [0, 1], we denote this class C S (M). 

Definition 10. For re (0, N), p G [l,oo), and M > 0, B' poo (M) is the 
class of functions f G L 2 ([0,1]) satisfying 

i i > 

p .1 . i i \ / \ p 



max ^(^-y (EK.^J , §/' (r+l ^ j ( El^l P ] N M - 
For p = oo, we define B^^M) := C r (M). 
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Figure 2: Wavelet coefficients for the functions in Figure 1. The height 
of each line corresponds to a wavelet coefficient f3j t k] the x-axis plots the 
location and the y-axis the scale j. 

We note that for C S (M), s g" N, and Bp j00 (M), these definitions are 
equivalent to the classical ones. For C S (M), s G N, and C S (M,I), the above 
conditions are slightly weaker; they include functions which are classically 
Holder for any smaller s or /. We may thus take the above as our definitions 
of the Holder and locally Holder classes in this paper. 

Our detectability condition is likewise stated in terms of the wavelet 
coefficients of /. Figure 2 plots the wavelet coefficients of the functions in 
Figure 1, using CDV-8 wavelets. We see that in regions where / is rough, 
the coefficients are often large; in regions where / is smooth, the coefficients 
are small. 

We use this intuition to motivate our definition of detectability. We will 
say a function's irregularities are detectable if, when wavelet coefficients in 
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a region are large at fine scales, they are also large at coarse scales. This 
property will allow us to identify the regions in which / is rough. 



Definition 11. For s G (0,iV), t G (0, 1), and an interval I C [0, 1], Df(I) 
is the class of functions f G L 2 ([0, 1]) satisfying 



The definition thus requires that each term in the wavelet series on /, at 
a fine scale j, lies within the support of another term, of comparable size, 
at a coarser scale j'. The parameter s controls how large this second term 
must be, and t controls how far apart the scales j and j' can be. We note 
that Proposition 4 then follows from the definitions. 

3 The adaptive-sensing algorithm 

We now describe our adaptive-sensing algorithm in detail. We first discuss 
how we estimate under varying designs; we then move on to the choice of 
design itself. 

3.1 Estimation under varying designs 

Given design points £ n := {x\, . . . , x n }, we will estimate the function / using 
the technique of wavelet thresholding, which is known to give spatially- 
adaptive estimates (Donoho and Johnstone, 1994). For any i G {jo, jo + 
1, . . . }, we may write an unknown function / G L 2 ([0, 1]) in terms of its 
wavelet expansion, 



and estimate / in terms of the coefficients aj 0t k,^j,k- 

When the design is uniform, we can estimate these coefficients in a 
computationally-efficient way using the fast wavelet transform. Suppose 
(as will always be the case in the following) that the design points x n 
are distinct, so we may may denote the observations Y n as Y(x n ). Given 
i £ {joijo + !)•••}) suppose also that we have observed / on a grid of design 
points k2~\ for k G {0, . . . , 2* — 1}. 

We may then estimate the scaling coefficients a.i of / as 



VjGUfUfl + i,...}, k-.s jik ni^$, 

3 i' G { [tj\ , . . . j - 1}, k' : Sj',k> => Sj,k, 



\M> 2(j ~ j '^ sH) \M- W 




fc=0 



j=i k=0 



&l >k := 2-§y(2-*£;), 
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since for i large, 



a hk = f{x)<p iJs {x) dx « 2-3/(2"**). (5) 

J Si t }. 

By an orthogonal change of basis, we can produce estimates a l - o k and /3* k 
of the coefficients ay 0j fc and /S^, given by the relationship 



2J0-1 i-1 2^-1 2 l -l 

«5o,fcVio,* + S Yl Pj,k^j,k ■= Yl & lk¥i,k- (6) 
k=0 j=3Q k=0 k=0 



These estimates can be computed efficiently by applying the fast wavelet 
transform to the vector of values (2~ 2 Y(2~ l k))k- 

Since we will be considering non- uniform designs, this situation will often 
not apply directly. Many approaches to applying wavelets to non-uniform 
designs have been considered in the literature, including transformations 
of the data, and design-adapted wavelets (see Kerkyacharian and Picard, 
2004, and references therein). In the following, however, we will use a sim- 
ple method, which allows us to simultaneously control the accuracy of our 
estimates for many different choices of design. 

To proceed, we note that the value of an estimated coefficient a 1 - k or 

j3j k depends only on those 6c\ t for which 2~ l l G Sj^', equivalently, it depends 
only on observations Y(x) at points x G Sj t k H 2~ l Z. Choose j™ ax G {jo + 
1, jo + 2, . . . } so that 2^'™ ax ~ n/log(n), and for j G {j , . . . , j™ ax - 1}, 
k G {0, . . . , 2? — 1}, let i n (j,k) be the largest index i for which we can 
compute a 1 - k or /3* k at time n, 

i n (j, k) := max {* G {j + 1, j + 2, . . . } : S^ k D 2"*Z C £ n } . (7) 

(We will select the design points so that this set is always non-empty.) We 
then estimate the wavelet coefficients ctj ,k an d /3j j- by 

«5;? ,fe) and hk--=Kt k) - ( § ) 

Using these estimates directly will lead to a consistent estimate of /, but 
one converging very slowly; to obtain a spatially-adaptive estimate, we must 
use thresholding. Fix k > 1, and for 

e n (j,k) := a2- 1 2^y^{nj 1 (9) 

define the hard-threshold estimates 

oT _ J @j,ki \Pj,k\ ^ K&n{j,k), 



0, otherwise. 
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We then estimate / by 

2^0-1 J™ ax -l2 J -l 

fn ■= Yl & 30,kVj ,k + Plkfak- ( 10 ) 

fc=0 j=jo k=0 

Given a uniform design £ n = 2~*Zn [0, 1), this is a standard hard-threshold 
estimate; otherwise it gives a generalisation to non-uniform designs. 



3.2 Adaptive design choices 

So far, we have only discussed how to estimate / from a fixed design. How- 
ever, we can also use these estimates to choose the design points adaptively. 
Suppose we choose the design points in stages, at stage m selecting points 
x nm _ 1 +i to Xn m in terms of previous observations Y\, . . . , Y Jlm _ 1 . The num- 
ber of design points in each stage can be chosen freely, subject only to the 
conditions that no is a power of two, and the ratios n m /n m _i are bounded 
away from 1 and oo. We may, for example, choose 



n. 



|2>"+™|, mG {0,1,...}, (11) 



for some j G N, and r > 0. 

In the initial stage, we choose no design points spaced uniformly on [0, 1], 

Xi := (i - l)/n , i G {1, . . . ,n }. 

At further stages m £ N, we will select design points x nm _ 1+ i, . . . ,x„ m so 
that the design £ nm approaches a draw from a target density p m on [0, 1]. We 
will choose p m in terms of the previous observations Yi, . . . , Y nm _ 1 , focusing 
our attention on regions of [0, 1] where we believe the function / is difficult 
to estimate. 

At time n m _i, for each j G {jo, ■ ■ ■ — 1} 5 rank the 2 J hard- 

thresholded empirical wavelet coefficients pj k in decreasing order of size. 
We then have 



oT 



> ... > 



/ W 1 (2^') 



for a bijective ranking function rj. Split the interval [0, 1] into sub-intervals 

_ -max r _ . . . f .'max , 

Il, m := 2-^m [1,1 + 1), I G {0, . . . , - 1}, 
and fix A > 0. We define the target density on I\ m to be 



Pl tTn := imax ( {A}U 
f 23 



■ 3 G Oo, • • ■ ~ !>, hm C S hk , Pi k * 
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where the fixed constant A > is chosen so that the density p m always 
integrates to at most one, 2 - - 7 ™ 1 ™* ^ p^ m < 1. The specific value of A is 
unimportant, but note that 

■ max --max 
2 J ™m -1 2^ 

;=o j=0 k=l 

so such a choice of A exists. 

To choose the new design points x nm _ 1 +i, . . . , x nm , we first include any 
points x 6 2~- ?n ™ Z n [0, 1) not already in the design. (We will assume the 
n m and j™ 8 * are chosen so that this requires no more than n m — n m -\ design 
points; since j™ ax is defined only asymptotically, and 2- 7 ™ 1 ™* = o(n m — n m _i), 
such a choice is always possible.) We then define the effective density q m , n , 
describing the nominal density of a distribution generating the design £ n . 
Define the effective density on I\ m at time n to be 

qi,m,n ■= n' 1 max {2* : i € N, 2 _i Z n I;, m C £„} . 

Again, note this density integrates to at most one, 2~ J ™m ^ ft, m ,n < 1- 

Our remaining goal is to choose the new design points to minimise the 
maximum discrepancy from p m to q m ,n > 

■max 
2 Jn m — 1 

max pt, m /qi >m ,n- (12) 

Having selected points x\, . . . ,x n , we pick an I maximising (12); note that 
doing so does not require us to calculate A. We then add points 2 *Z n Ii^m 
to the design, choosing the smallest index i for which at least one such point 
is not already present. We repeat this process until we have selected a total 
of n m design points; for convenience, let qi m := qi m nm denote the effective 
density on once we are done. 

The final algorithm is thus described by Algorithm 1; it can be imple- 
mented efficiently using a priority queue to find values of I maximising (12). 
We will show in Section A. 2 that this algorithm satisfies the conditions of 
Theorems 6 and 8. 

4 Implementation and experiments 

We now give some implementation details of Algorithm 1, and provide empir- 
ical results. Before we test the algorithm, we must describe how we compute 
f n , and choose the parameters governing the algorithm's behaviour. 

4.1 Estimating functions 

For simplicity, in (10) we defined /„ in terms of wavelets only up to the 
resolution level j™ ax . While asymptotically this carries no penalty, in finite 
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Algorithm 1 Spatially-adaptive sensing 
n <— no 

xi,...,x n <— rT x TL n [0, 1) 
observe Y\ , . . . , Y n 
m ^— 1 
loop 

x n+ i, ...,x n ,<- n [o, i) \ i n 

n <— n' 

while n < n m do 

choose I maximising pi >m /qi 

,m,n 

S <— 2~ % 7L n Ii >m \ £ n , for the smallest i such that S ^ 
repeat 

n ^— n + 1 

choose x n G S 

S^S\{ Xn } 
until 5 = or n = n m 
end while 

observe Y nm _ 1+ i, . . . ,Y n 
estimate / by f n 
m <— m + 1 
end loop 



time we may do better by estimating all the wavelets for which we have 
available data. In other words, we use the estimate 

2^0-1 oo 2^-1 

fn ■= Y + Y Y Plkfyfr ( 13 ) 

k=0 j=jo k=0 

where for j > j™ ax , k G {0, . . . , 2 J ' — 1}, if the set in (7) is empty, we let 
i n (j, k) := — oo, forcing (3j k = 0. We note that since there are finitely many 
design points, the sum in (13) must have finitely many non-zero terms; define 

J n ■= 1 + sup{j G N : 3 k, i n (j, k) > -oo} 

to be the resolution level at which this sum terminates. 

To compute these estimates f n , we must convert the estimated coeffi- 
cients aj 0j fc, f3j k back into function values f n (x). For % G { J n , J n + 1, . . . }, 

to evaluate /„ at points x = 2~ l k, k G {0, . . . , 2 l — 1}, we make the approx- 
imation 

; n (2- l £0«2io^, (14) 
where the post-thresholding scaling coefficients a[ k are defined by 

2 i -l 

Y <*i,kVi,k '■= fn- 
k=0 
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These can again be computed efficiently using the fast wavelet transform. 

Given a uniform design, and predicting / only at the design points, this 
is enough to give estimates / n ; if we set k = 1, we have just described a 
standard hard-threshold wavelet estimate (Donoho and Johnstone, 1994). 
In that case, the observations and predictions are always made at the same 
scale, i n {j, k) = i, so the errors in (5) and (14) tend to cancel out. In other 
cases, however, the observations and predictions may be at different scales; 
these errors then may build up, making f n look like a translation of /. 

To resolve the issue, we will use a slightly different definition of the 
estimated coefficients &j 0t k and (3j^, which ensures the scales of observation 
and prediction are the same. Given i G { J n , J n + 1, . . . }, to estimate / at 
points x = 2~ l k, k G {0, . . . , 2* — 1}, we set x n ^ := sup{x G ^ n : x < 2~ l k}, 
and let 

&i,k ■= 2"5y(x njfe ). 
We then define the estimates &j ,k and /3^fc by 

2^0-1 i-l 2J-1 2 4 -l 

k=0 j=j k=0 k=0 

using the fast wavelet transform as before. 

We note that this definition is approximately the same as the one in (8); 
while it is harder to control theoretically, it gives improved experimental 
behaviour. We also note that, with a uniform design, if we wish to predict / 
only at the design points, this again reduces to a standard wavelet estimate. 

4.2 Choosing parameters 

To apply Algorithm 1, we must choose the parameters k, A and r, and also 
estimate a if it is not already known. The parameter k governs the size of 
our wavelet thresholds: larger k means we will be more conservative. While 
our theoretical results are proved for choices k > 1, in our empirical tests we 
took k = 1. This gives a simple choice of threshold which performs well, and 
allows us to compare our results with standard hard-threshold estimates. 

The parameter A controls how uniform we make our design points: for 
A > the design points will be mostly uniform, while for A ~ they will be 
concentrated at irregularities of /. The parameter r likewise controls how 
many design points we choose at each stage: for t>0 there will be a few 
large stages, while for r ~ there will be many small ones. Empirically, we 
found the values A = r = ^ gave good trade-offs. 

Finally, for uniform designs, Donoho and Johnstone (1994) suggest esti- 
mating a by the median size of the (3j^ at fine resolution scales. Our designs 
may not be uniform, but they are guaranteed to provide us with estimates 
f3j t k up to level j™ ax — 1. We will therefore use the similar estimate 

d n := median{22% jfc | : j > j" ax - l,i n (j, k) > -oo}/0.6745, 



15 



which includes all estimated coefficients at scales at least this fine. 
4.3 Empirical results 

We now describe the results of using Algorithm 1 to estimate the functions in 
Figure 1, observing under N(0, 1) noise. To measure the spatial adaptivity 
of our estimates, we evaluate procedures in terms of their maximum error 
over [0,1], approximated by 

max \f n (x)-f(x)\ 
xe2-JZn[o,i) 

for j large. In our tests we considered n up to 2 15 , so took j = 17, to avoid 
biasing the performance measure towards a uniform design. 

We then compared the performance of a fixed-design hard-threshold es- 
timate, letting no = n, with that of our adaptive-sensing algorithm, letting 
no = 2 6 , and n m be given by (11). The parameters k, A, r, and & n were 
chosen as in Section 4.2. We used the family of wavelet bases described by 
Cohen et al. (1993), and implemented in Nason (2010); we took wavelets with 
N = 8 vanishing moments, set jo = 5, and j™ ax = max(jo + 1, \ nj log(n)J ). 

In our fast wavelet transforms, we also applied the preconditioning step 
of Cohen et al. (1993), which reduces boundary effects by applying a linear 
transformation to the first and last N data points. It can be checked that 
computing aj± or j3j^ still depends only on observations Y(x) at points 
x G Sj t k, so this change need not affect our choice of design points. 

Figure 3 compares the performance of these procedures on the Doppler 
function; the values plotted are sample medians after 250 runs, together 
with 95% confidence intervals for the true median. We can see that for n 
large, adaptive design significantly outperforms the uniform one, and the 
improvement increases with n. 

Table 1 compares performance on all the functions in Figure 1, given 
n = 2 15 observations. We again report sample medians after 250 runs, 
together with the p- value of a two-sided Mann-Whitney-U test for difference 
in medians. (We note that the large errors reported for the Blocks function 
are due to the large discontinuities present, which are difficult to estimate 
uniformly over [0,1].) 

We can see that, for the Heavisine function, there is no significant dif- 
ference in performance; looking at Figure 1, we notice that this function 
is fairly spatially homogeneous, perhaps suggesting why adaptive sensing 
is of little benefit. For the other three functions, however, there is a sig- 
nificant improvement. We thus conclude that adaptive sensing can be of 
value in nonparametric regression whenever the function / may be spatially 
inhomogeneous . 
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Figure 3: Log-log plot of empirical performance on the Doppler function. 





uniform design 


adaptive design 


p- value 


Blocks 


13.815 


11.417 


< 0.001 


Bumps 


3.866 


3.572 


< 0.001 


Heavisine 


2.688 


2.768 


0.747 


Doppler 


1.605 


1.338 


0.016 



Table 1: Empirical performance on the functions in Figure 1 for n = 2 
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A Proofs 

We now provide proofs of our results. We consider separately the negative 
results, which establish lower bounds, and the constructive results, which 
control the performance of Algorithm 1. 

A.l Negative results 

We now prove our first result, that under a uniform design, restricting to 
sparse detectable functions does not alter the minimax rate of estimation. 
We will require Fano's lemma, which relates the probability of misclassifying 
a signal to the Kullback-Leibler divergence between the alternatives (see 
Tsybakov, 2009, §2.7.1). Given probability measures P and Q on M. d , having 
densities p and q respectively, define the Kullback-Leibler divergence from 
P to Q, 

D(F\\Q) := j p(x)\og dx. 



Lemma 12 (Fano's lemma). Let X G M. d have distribution Pj, for some 
i G {1, . . . , k}, and let ip(X) be an estimate of i. Then 

log (A; - 1) 

where 

1 k 

We also make the definition 



imi/,oo : = SU P |/(X)|, 
xEl 

for functions / : [0, 1] -)■ R, and I C [0, 1]. 

Proof of Theorem 5. The argument proceeds as a standard minimax lower 
bound; we construct functions f n> k eJa distance c n apart, and show we 
can only distinguish between them when c n > (n/ log(n))~ a ( s \ 
Choose j G N so that 

2 j ~ (n/logCn)) 1 /^ 1 ), 
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and define a sequence j± , . . . , j m by 

3m ■= j, ji-i ■= [tji\ , ji € {jo, ■ ■ ■ , |"f 1 _ 

Since there are n design points Xj, for n large we must have an interval 
Sjm-i,k — I containing < 2~ Jm ~ 1 n of them; we will assume without loss of 
generality that this interval is always Sj m _ 1 fl. We then consider functions 
fn,k G J 7 , given by 

m— 1 

/„,* == £ M2^( s+ i)^, + C2-^(' + 3)^- m)fc , 
i=i 

where A; € {0, ... , 2 Jm_Jm - 1 - 1}, and C G (0, M] is a constant to be deter- 
mined. 

Suppose \\f n - /|| /j00 = Op(cn), uniformly over J 7 , for a sequence c„ ^ 
(n/log(n))- a ( s ). Then on a subsequence, we have c n = o ((n/ log(n)) a ( s ^ ; 
passing to the subsequence, we may assume this is true for all n. Since the 
ipj m: k have distinct supports, for n large we have 

min||/n, fc - fn,k>\\l,oo <: C2-^ s > C(n/log(n))- a W. 

Thus for n large, /„ can distinguish between the f n< k with arbitrarily high 
probability. Let P& denote the distribution of the observations when / = 
f Uj k', then any estimate 

k n := argmin||/„ - f n ,k\\oc 
k 

of k satisfies 

supP fe (fc n ^ k) ^0 
k 

as n — >■ oo. 

However, for fc, fc' G {0, . . . , 2 jm ~ : > m - 1 - 1}, the Kullback-Leibler diver- 
gence from Pfc to Pfc/ is 

n 1 

D(Pfc || PfeO = E ^2 (/n,fc(»i) - /n,fe'(^)) 2 
i=l 

= ^C 2 2-^^ +1 ) £ (^ >fc (x0 - ^^(x*)) 2 , 
i=i 

i=l 
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Thus as < 2 im-\ n design points lie within \J k Sj m ^ k C Sj m _ u o, 

n 

2 - 2 (j™-im-i)J2D(F k || P fc /) < C 2 2^-i-^( 2s+1 ^^l(x i G 5 imjfc ) 
fc,fe' f=i fc 

< c 2 2 -i m (2 S +l) n < ^2 ]og(n) _ 

As there are 2 jfm_Jm - 1 alternatives for fc, and j m — j m -i ^ log(n), when C 
is small this contradicts Fano's lemma. □ 

We next provide similar lower bounds for adaptive-sensing algorithms. In 
this case, the argument from Fano's lemma presents difficulties; instead, we 
will argue using Assouad's lemma, which bounds the accuracy of estimation 
over a cube in terms of Kullback-Leibler divergences (see Tsybakov, 2009, 
§2.7.2). While this choice leads to the loss of a log factor in the results 
proved, it allows us to give bounds which apply also for adaptive sensing. 

Lemma 13 (Assouad's lemma). Let := {0, l}" 1 , and for p G (0, i], define 
a distribution ir over £1, 



7r(uj) = pI>i(1_p)Ei(l- 



For each uj G {0, l} m , let F u be a probability measure on M d , and E w the 
corresponding expectation. Then for any estimator u ofco, 



7r(w)E aj p((i, uj) > pm I 1 



1 m 



2m 

1=1 u6$) 



where p(u,u') is the Hamming distance, and oj 1 equals oj except in the ith 
coordinate. 

Our argument then proceeds as in Arias-Castro et al. (2011); we will 
start with a simple lemma on the truncated expectation of binomial random 
variables. 

Lemma 14. If X ~ Bin(n,p), then KX1(X > 2np) = o{np) as np — > oo. 
Proof. Considering the mass function of X, we have 

EX1(X > 2np) = np¥(Y > 2np - 1), 
where Y ~ Bin(n — l,p). From Cheybshev's inequality, we then obtain 

FCY > 2np - 1) < - — l — >0. □ 

(n — l)p 

We next give a lemma which allows us to control the performance of 
adaptive-sensing algorithms. 
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Lemma 15. Given sequences j n , k n , l n G N, with 2 3n >k n >l n ^ oo, l n = 
o(k n ), let K n := {0, . . . , k n — 1}, and K n ■= {K C ET n : < l n } . Given also 
functions f n G L 2 ([0,1]), and a sequence fi n satisfying < fi n < C^Jk n jn 
for C small, define 

fn,K ■= fn+^2^n2-^ jn ij jni k, K C K n . 

k&K 

Finally, let I C [0, 1] 6e an interval satisfying U^Q 1 S'j nj fc C I, for n large. 
Suppose that an adaptive- sensing algorithm with estimator f n satisfies 

\\fn ~ U,k\\i j00 = O p (c n ), 
uniformly over K G JC n . Then 

Proof. We have 

since for n large, / nj i^ — f n ,K' must be given by a single wavelet on some 
interval 2~^ m [I, I + 1) C /. Suppose \\f n — f n ,K\\j ^ = O p (c n ), uniformly over 
/C n , for a sequence c n ^ On a subsequence, we have c n = o(fi n ); passing 
to that subsequence, we may assume this is true for all n. Any estimate 

K n := argmin||/„ - f n ,K\\ I oo 

of K then satisfies 

sup F K (K n ^ K) -> 

as n — 7- oo; we will show this contradicts Assouad's lemma. 

Define a distribution tt over K C K n , letting the variables l(k G X), 
A; G K n , be i.i.d., so that 

\K\ ~ Bin ( fc n , — p 

Denote by the expectation when we first draw -fC according to 7r, and then 
observe under P^. Since |-fC n Alf| < l n + |iT| (where A denotes symmetric 
difference), we have 



E^KnAK] < 2 E^|K|1(|K| > l n ) + Z n sup P^(i^ n / if) 
= o(Z„), 
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using Lemma 14. 

However, for K C K n , we also have 



Kfl 1 1 ^ 1 

E E n D(F K || P^Alfe}) = E E -E 2^2 (/«,*-(*) - /n,ifA{fe}(^))' 
fc=0 k=0 i=l 

n k n — l 



i=l k=0 

< C 2 k n . 

Thus for C small, by Assouad's lemma, 

^Tr\K n AK\ > /„, 

giving us a contradiction. □ 

We may now proceed to prove our adaptive-sensing lower bounds, ap- 
plying this lemma in several different contexts. 

Proof of Theorem 7. We first prove c n > n- a{T '\ Choose j G N so that 

2^^/(2^), 

and define a sequence jx,..., j m by 

jm ■= 3, ji-i ■= [tji\ , ji G {jo, • • • , [f 1 - !}• 
We consider functions 

m-l 

fn,K ■■= E E M2^( r +I)v^, fc + E C2^('-'+l)^ mifc , 
i=l fc=0 k&K 

where if C := {k G {0, . . . , 2 im - 1} : S jm>k C /}, and C G (0, M] is 
small. Let 

/C„ := {K Q K n : \K\ < 2*1™} , 
so if X G /C n , then G J 7 . Applying Lemma 15, we obtain 

cn>n- a( - r '\ 

To show c n > n~ a ( s \ we make a similar argument, this time setting 

2 3 ~ n l/(l-pi«)(2s'+l) ) 

and defining ji,...,j m as before. For n large enough, we must have an 
interval Sj mt k ^ ^; we will assume without loss of generality this interval is 
Sjm.fi- We then consider functions 

m-l 2^( 1 -P U '-1 

fn,K := E E M2~^)^ k + E M £n 2-^( s+ 5)^ m , fe , 

i=l fc=0 keK 
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for K C K n := {0, . . . , 2-»' m ipu - 1} , and e n G (0, 1] a sequence to be 
determined, with e n —> 0. Let 

Kn := {KCK n : \K\ < e~P} , 

so again if K G /C n , G T . For any e n decreasing slowly enough, we may 
apply Lemma 15, obtaining c n > e n n~ a<yS ); we must therefore have 

Proof of Theorem 2. This follows as a corollary of Theorem 7. We apply 
the theorem to classes 

T = B^ 00 (M)nDf(J), 
noting that T C C S (M) C C S (M, J). □ 

Proof of Theorem 3. This follows as in the second half of Theorem 7. If we 
instead set 2? ~ n 1 /(2s+i) ) 

/n,K := £ Me n 2- J "(-+I)^. fcj 
and := {/c € {0, . . . , 2- 7 — 1} : Sj^ C I}, by the same argument we have 

A. 2 Constructive results 

We now prove that Algorithm 1 attains near-optimal rates of convergence. 
Our proofs involve a series of lemmas; the first shows that the algorithm 
chooses design points so that the discrepancy from the target density p m , to 
the effective density q m , remains bounded. 

Lemma 16. Let the design points x n be chosen by Algorithm 1, and suppose 

l + 2 C<^- < D, 
n m -i 

for constants C, D > 0. Then for m larger than a fixed constant, 

C C 

qi,m > pPl,m > Tj A)y - 

Proof. Suppose at stage m, the new design £ m included dyadic grids 2~ l Zn 
Il m , where the indices i were chosen to ensure that there were at least 

• max 

n, m '■= Cn m -i2 pi m 
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such points in the interval Ii m , 2 l ~i™™ > r\ m . We will show that for this 
design, the discrepancy from p m to q m is bounded, and that Algorithm 1 
must do at least this well, 

Since pi^m ^ A\, for m large we have ri m > 1. This design would thus 
include the points 2~ 3n ™ Z n [0, 1), and would require at most 



■max 
2 3 "m — 1 



'y ] 2 r i,m — 2Cn m _i < n m — n m -i 

1=0 

additional design points; we would then have 

-max 

^ m i"i,m t 

Ql.m > — > ~^Pl,m- 

n m D 

Since Algorithm 1 includes the points 2 _ -'™Z n [0, 1), and then chooses its 
remaining design points to minimise 

-max 
2 3n m -1 

max pi tm /qi, m , 

the same must be true for its choice of design points. The final inequality 
follows as pi jT n 

> AX. □ 

We next consider the operation of Algorithm 1 under a deterministic 
noise model. Let e n be given by (9), and suppose that our estimates &j 0l k, 
(3j t k are instead chosen adversarially, subject to the conditions that 

l«j'o,fe ~ a io,fcl ^ e «(io, k), \(3 jt k - P jtk \ < e n (j, k), (15) 

for all k, and j G {jo, . . . , j™ ax — 1}- We then show that, in this model, the 
target densities p m will be large in regions where the wavelet coefficients /3j t k 
are large. 

Lemma 17. In the deterministic noise model, let J- be given by (1), for p < 
oo. For any f £T,me N, j G {j , • • . ~ and k e {°« • • • > 2J ~ !}> 

suppose 

|/3j,fc| > (K + l)e nm „ 1 (i,A;). 
T/ien /or 1/ m C S^fc, we /mwe 



Pl,m r^. 



log(n m ) 2 
uniformly in f, m, j, and k. 
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Proof. We first establish that, for non-thresholded coefficients, our estimates 
T 



betaj k are of comparable size to the Pj tk - Let n := n m -\. Firstly, we have 



\Pj,k\ > \Pj,k\ ~ e n (j,k) > Ke n (j,k), 
so pj k 7^ 0. Suppose \Pjk>\ > \Pjk\ ^ or some k' . Then j3j k , / 0, so 
\Pj,k'\ > \Pj,k'\ -e n (j,k') > («- l)e n (j, k'), 

and 

< \Pj,k'\ +e n (j,k') < |/3j )fe /|. 

We thus obtain that 

l^j.fe'l <; \Pj,k'\ > \Pj,k\ > \Pj,k\ ~ e n {j,k) > \P j>k \. 

We may then conclude that, for such coefficients, the noise has little 
effect on the target density. Since / G J, the number of coefficients Pj k / for 

which |/3j fc ,| > \pj >k \ is thus < 2- jp ( s+1 *-r)\P jtk \- p , and for any I l>m C S jyk , 
the target density 

uk\ 



As j™ ax < log(n m ), and these bounds are uniform over /, m, j, and k, the 
result follows. □ 

Next, we prove a technical lemma, which shows that each term in the 
wavelet series of / will lie within the support of larger terms at lower reso- 
lution levels. For a given function /, if j, k,j', k' satisfy (4), we will say that 
Pj' t k' is a parent of Pj >k . 

Lemma 18. Let J- be given by (1), and pick j™ in £ N, 

x _4\ l/(2tr+2(l-t)s+l) 



•min / . , , ]_ ^ \ - 

2-Jn ^ I n / log(n) P I 



T/ien /or any / G T , n £ N, j £ {j Q , . . . ,j™ x - 1}, and S^ k n 7 / 0, a 
sequence of wavelet coefficients Pj 1>kl , ■ ■ ■ , /3j d ,/c d o/ / satisfy: 

(i) Pji,ki is a parent of /3 ji+uki+1 ; 

(a) h < in m > id = i> fc d = and 

fraj d is bounded by a fixed constant. 
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Proof. If j < j™ m , we are done. If not, since / G J-, we must have a 
coefficient fe / which is a parent of f3j.k. Choose Pf^' so that j' is minimal, 
and if also j' > j™ in , let Pjn^" be a parent of fiji^. We will show that we may 
continue in this fashion until we choose a coefficient Pj 1 ki with j\ < j™ in . 
If j' > j™ m , we have that j" < j, Sj"^>> 3 and 

|/VH> 2(i ~ /,)(s+l Wl- 

If also j" > tj, this would make a parent of contradicting our 

choice of j'. Thus j" < tj. Since every two steps, we reduce j by a factor of 
t, and J™ ax /j™ m tends to a constant, it take at most a constant number of 
steps to reach j™ m . □ 

We may now show that, in this model, the algorithm will ensure all large 
coefficients are estimated accurately. 

Lemma 19. In the deterministic noise model, let the design points x n 
be chosen by Algorithm 1, and let J- be given by (1). For n = n m and 
C > large, not depending on f, the following results hold for all j E 
{jr n ,...,jT x -l}, andke{0,...,V-l}. 

(i) e n (j,k) < C(n/log(n))~a. 

(it) If f G J 7 , p < oo, and Sj : k D / ^ 0, define 

tr + (l-t)s + \ 
v := k , 

1 + - 

P 

and 

<5 n :=C(n/log(n) 3 )- 1/(p+2) . 

Then 

\/3 jt k\ > (« + l)2- ]V 5 n =► e„(j, fe) < 2~^ n . 



Proof. We consider the two parts separately. For part (i), for n large we 
may use the fact that the effective density is bounded below; we have that 
by Lemma 16, q^ m > 1, so 2 %n ^'^ > n, and 

e n (j,k) < (n/log(n))~2. 

The result thus holds for C large. 

For part (ii), we will argue that for / £ J 7 , large coefficients j3j^ must 
have large parents, which we can detect over the noise. We will thus place 
more design points in their support, and so estimate the /3j k more accurately. 

Suppose \Pj t k\ > (« + 1)2~ J '<5„, and apply Lemma 18 to fy^. We then 
obtain wavelet coefficients Pj 1 ,k 1 , ■ ■ ■ , fij d ,k d satisfying the conditions of the 
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lemma, which we choose so that d is minimal; we proceed by induction on 
d. If d = l, then j <j™ in , and 

2-i v 5 n >2-^ av 6 n > (n/log(n))-|. 

For C large, the claim then follows from part (i). 

Inductively, suppose the claim holds for d — 1. If &| ^ (k ~i~ 1)2 ^ v 5 n , 
then we have 

|«. I > 2O'~^-i)( s +§-'02~^-i i; <$ > 2~ jd ~ lV () 

since 5 n > S n , m _ 1 , and 

s+ 1 - % - 
s+\-v> ,1p 2 >°- 

1 ' 2 

For n and C large, we may thus apply the inductive hypothesis at time 
n m _i, obtaining 

l&d-i.fcd-xl > (K + l)e nm _i(id-l,^-i)- 
We then apply Lemma 17 to Pj d _ 1 ,k d ^ 1 , obtaining that, for any Ii m C 

Sjd-i,k d -i ' 

>log(n)- 2 2^^+ 2 )|/3 ijA r 
> log(n)- 2 2 2 ^^. 

For n large, by Lemma 16 we will have 

«, m >log(n)- 2 2 2 ^ 

for such I. Since C S jd _ likd _ 1} we also have 2^'.*) > n log(n)- 2 2 2 ^<5£, 
and 

en(j,A:)<n-ilog(n)l2-^^ 
< 2-*6 n . 

Thus for C large, the claim is also proved for d. 

From Lemma 18, we know the number of induction steps is bounded by 
a fixed constant, so there must be a single choice of n and C large enough 
to satisfy all the above requirements. As this choice is also uniform over 
/ G J, the result follows. □ 

Using this result, we conclude that Algorithm 1 attains good rates of 
convergence over spatially-inhomogeneous functions. 
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Lemma 20. In the deterministic noise model, let the design points x n be 
chosen by Algorithm 1, T be given by (1), and c n by (3). Then 

sup||/„ - f\\ I oo = 0(c n ) . 

Proof. We may bound the error in f n over I by 
Wfn ~ /ll/,oc ^ _ m ax \&j ,k ~ <Xjo,k\ 

+ V max 2^\M k - f3 jk \ + V max 2&\p jk \. 

3=30 3=3n 

In the deterministic noise model, |oy ,fc — ct/ ,£;l < e n(j,k), and the thresh- 
olded coefficients /3j fc fall into one of two cases. 

(i) If Pj k = 0, then 

\Pj,k\ < \Pj,k\ + e n (j, k) < (k + l)e„(j, fc), 

so 

l/^Jfe - #?,k| = \Pj,k\ % 6n(i, fe). 

(ii) If pT k ± 0, then 

\Pj,k\ > \Pj,k\ ~ e n (j, k) > (k - l)e„(j, fc), 



so 

Thus, in either case, we have 

\Pj,k ~ Pj,k\ ^ min(e n (j, fe), \Pj,k\)- 

For n large, we may then bound the error in /„ using Lemma 19; since 
the ratios ra m /n m _i are bounded, it suffices to consider times n = n m . The 
contribution from the ctj 0t k is of order (nj log(n)) - 2 , so may be neglected. 
Considering the /3j t k terms, if r < s, from Lemma 19 and the definition of 
our detectability condition, we obtain the bounds 

e n (j,k) < (n/log(n))-i, \^ k \ < 2'^). 
Pick j n e N so that 

2 jn ~ (n/logtn)) 1 /^ 1 ), 
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and we obtain 

On— 1 oo 

II L ~ /l|/ j00 < E 2^'(n/log(n))-| + £ 2"* 

<(n/log(n))- a W (16) 

Assume instead we have r > s, so p < oo. From Lemma 19, we then 
obtain the additional bound 



(e n (j,k),\P jtk \)<2-i v 6 n . 



mm 

If r' < s', then v > ^, so 

„'min i ;max i 

On ~ X 3n ~ A OO 

llA-/ll/,oo< E 2^(n/log(n))-|+ E 2- j ( V -^5n+ E 2 ^ 

J=i0 j=jn in 3=3%** 

^ — . i « min , , , v _1 _ , ) -max „ 

< 22^ (n/log(n)) 2 + 2 J « s 

< (n/logH 1 ^')'^ , 

which is 0(c n ), as in this case r' > ~. 

If instead r' > s', then v < \. Pick j n G N so that 

2^ -(n/logln) 3 ) 1 /^^^ 1 ), 

and we obtain 

ll/n " /H/,oo < E 2^(n/log(n))-i + E 2^-^ n + E 2" is 

J'=J0 0=0T n 0=jn 

< 2^'" in (n/log(n))-5 +2~^ s 



< 



V log(n) 



3 x -a(s') 



Similarly, if r' = s', then u = ^, and by the same calculation we obtain 

ll/n-/ll/,oo< (n/logH 3 )- a(s,) log(n). 
As these bounds are all uniform over / G J, the result follows. □ 

We next show that the conditions of the deterministic noise model are 
satisfied with probability tending to one, uniformly over functions / in 
Holder classes C^(M). 

Lemma 21. In the probabilistic noise model, let the design points x n be 
chosen by Algorithm 1. Then at times n = n m , there exist events E m on 
which condition (15) holds with F(E m ) — > 1; this convergence is uniform 
over f in classes C^(M), for any M > 0. 
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Proof. We consider only the estimated coefficients $j ik ; the result for the 
&j 0t k follows similarly. We will show there exist events E m on which all 
possible estimates /3* fc , for all possible choices of design £ n , satisfy (15) 
simultaneously. 

Note that at time re, we estimate coefficients only up to level j'™ ax — 1. For 
3 S {jo, ■ ■ ■ , i™ ax — 1}, the quantities i n (j, k) must, by definition, be bounded 
above by a term i™ ax , with 2 ln := re 2 . Likewise, for n large, by Lemma 16 
the effective densities eft >m > 1, so the quantities i n (j, k) are bounded below 
by some i™ m , with 2 4 ™ m > n. 

For k G {0, . . . , 2*™ ax — 1}, generate observations 

Y k := f(k2-^) + e k , e k ^ N(0, a 2 ), 

and define estimates /3j fc in terms of the as in (6). For any choice of 
design points x n , our estimated coefficients /3j ik will be distributed as ^ k ,k \ 
for quantities i n (j, /c) G {«™ m , • • • , ^n ax } a l so depending on the 
Since / G C§ (M), for i G {C n , ■ • • , C ax }> we have 



a Jifc -2-2/(2- J A:) < J \f(x) - f(2- l k)\^ k (x) dx < 2'\ 
and so the estimates 

oj )fc ~ W (a i)fc + O^"*),^ - ') 
as n — > oo. Since each estimate /3* fc = 22i=o c l a \ii f° r a vec tor q with 

.... i 

|| Q || 2 = 1, by Cauchy-Schwarz ||q||i < 2a, and we obtain 
/3j, fe ~iv(^ ifc + o(2-|), ( 7 2 2- i 
For given z, j, k, the probability that 

1/8},* " %l > <72"V21og(n) (17) 

is thus 

2$ (-v/21og(re)+0(l)) < l/nv/log(n), 



using the fact that $(— x) < (j){x)/x for a; > 0. By a simple union bound, 
the probability that any /3* fc satisfies (17) is, up to constants, given by 



uniformly over / G C5(M); the result follows. □ 
Finally, we may combine these lemmas to prove our constructive results. 
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Proof of Theorem 6. For r > ^ + hi the Besov classes B^^M) are em- 
bedded within Holder classes C^{M). By Lemma 21, the conditions of the 
deterministic noise model therefore hold at times n = n m , with probabil- 
ity tending to 1 as m — > oo. In the proof of Lemma 20, we require those 
conditions only at finitely many times n m , . . . ,n m+ d, with d bounded by a 
fixed constant; the conclusion of that lemma thus holds also with probability 
tending to 1 as n — > oo. □ 

Proof of Theorem 8. Given / £ C S (M, I), for j large, and k such that Sj^ H 
J 0, we have \Pj,k\ < M2~ J ( s+ 2). We note that, given this condition, we 
may prove a bound 

ll/n-ZIU^WMn))-^) 
similarly to (16); the result then follows as in the proof of Theorem 6. □ 

Proof of Theorem 1. The proof of Theorem 8 remains valid even if we set 
n = no, in which case we are describing the performance of a standard 
uniform-design wavelet-thresholding estimate. □ 
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